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Abstract 

Motivation: 

RNA sequencing enables allele specific expression (ASE) studies that 
complement standard genotype expression studies for common vari- 
ants and, importantly, also allow measuring the regulatory impact of 
rare variants. The Genotype-Tissue Expression project (GTEx) is col- 
lecting RNA-seq data on multiple tissues of a same set of individuals 
and novel methods are required for the analysis of these data. 
Results: 

We present a statistical method to compare different patterns of ASE 
across tissues and to classify genetic variants according to their im- 
pact on the tissue-wide expression profile. We focus on strong ASE 
effects that we are expecting to see for protein-truncating variants, 
but our method can also be adjusted for other types of ASE effects. 
We illustrate the method with a real data example on a tissue-wide 
expression profile of a variant causal for lipoid proteinosis, and with a 
simulation study to assess our method more generally. 
Availability: 

MAMBA software: http : / /birch . well . ox . ac . uk/~rivas/mamba/ 
R source code and data examples: http://www.iki.fi/mpirinen/ 

Contact: 
matti.pirinen@helsinki.fi 
r ivas@ well . ox . ac . uk 



2 



Downloaded from http://biorxiv.org/on September 18, 2014 



1 Introduction 

Advancements in sequencing technologies are enabling unprecedented possi- 
bilities to study the transcriptome. In RNA-sequencing studies, it is possible 
to distinguish between transcripts from the two haplotypes of an individ- 
ual using heterozygous sites. This approach, called allele specific expression 
(ASE) analysis, allows an alternative way to quantify cis-regulatory varia- 
tion, complementary to eQTL analysis. Additionally, ASE has been utilized 
to analyze transcriptome effects of nonsense-mediated decay triggered by pre- 
dicted loss-of-f unction variants (MacArthur et ai, 2012; Montgomery et ai, 
2011; Lappalainen et ai, 2013). 

The Genotype-Tissue Expression (GTEx) project is establishing a re- 
source database and tissue bank for the scientific community to study the 
relationship between genetic variation and gene expression in human tis- 
sues (GTEx-Consortium, 2013), with an aim to interpret GWAS findings 
for translational research. The project is analyzing gene expression from 
various perspectives, including transcript structure, expression quantity and 
diversity, eQTLs, and allele specific expression differences. Furthermore, as 
medical genetics pursues exploration of rare variants, insights gained from 
the study of DNA and RNA sequencing data in the GTEx project will be- 
come important for functional interpretation of rare variants (Rivas et ai, 
2011, 2013; Zuk et al, 2014). 

To date, some methods have been proposed for the analysis of allele 
specific expression data, but these methods largely focus on a single tis- 
sue (Ronald et al, 2005; Zhang et al, 2009; Degner et al, 2009; Sun, 2012) 
although some could be applied also to multiple tissues (Skelly et al, 2011). 
The multi-tissue aspect is important for interpreting disease association find- 
ings (Grundberg et ai, 2012; McCarroll et ai, 2008) since eQTL and early 
ASE studies suggest widespread tissue specific effects of cis-regulatory vari- 
ants (Dimas et ai, 2009; Gutierrez- Arcelus et ai, 2013). Currently, more 
sophisticated methods for cross-tissue eQTL analysis are emerging (Flutre 
et ai, 2013). However, eQTL analysis requires large sample sizes, while ASE 
analyses can be conducted with significantly smaller data sets. 

In this paper we present novel statistical methods for analyzing ASE 
patterns from RNA-seq data across multiple tissues. Our main motivation 
are phenomena such as nonsense-mediated decay that are expected to lead 
to strong ASE where one of the alleles is expressed only a little if at all. 
From a statistical point of view an advantage of strong ASE is that it can 
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be studied even with a single individual and without very large read counts. 
(See Middle panel of Fig. 4 for an example data set.) 
We address three main questions: 

• In which tissues does a heterozygous site show ASE? 

• Which tissues show similar ASE effects at the site studied? 

• What proportion of a certain class of variants (such as, e.g., protein 
truncating variants) show ASE in all tissues, only in some tissues, or 
in no tissues? 

For the first question, a standard frequentist version of the binomial test is 
commonly used. However, an interpretation of such a test depends on fac- 
tors like the read count and simultaneous analysis of multiple tissues is very 
challenging. Hence, we require that our statistical framework allows a simul- 
taneous comparison between several cross-tissue models for observed data. 
For example, we want to weigh relative support of the model where all tissues 
show ASE to the models where only a single tissue shows ASE and to the 
null model where none of the tissues show ASE. For this purpose, we adopt a 
Bayesian model comparison framework. Among its favorable properties are 
natural ways to compare models with differing number of parameters and 
to fully account for the amount of available data when evaluating relative 
support of the models. 

2 Methods 

2.1 Grouped tissue model (GTM) 

We consider RNA-seq read counts overlapping a particular genomic position 
from multiple tissues of one individual who is heterozygous at that position. 
(See the Discussion and Supplementary Information for extensions to mul- 
tiple sites and individuals.) For tissue s = 1, . . . ,T, let y sl and y s2 be the 
number of reads supporting the reference and non-reference allele, respec- 
tively, and let n s = y s \ + y s2 - We classify the tissues into three groups: no 
ASE (AT), moderate ASE (M.) and strong ASE (S) and denote the group 
of tissue s by 7 S 6 {Af, Ai,S}. (See Middle panel of Fig. 4 for an exam- 
ple data set that motivates us to use the three groups chosen.) For each 
group G G {M, Ai,S}, we denote the proportion of the transcripts with the 
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reference allele by 9(G). We use a binomial sampling model for the data 
conditional on the group indicators: 



Possible expression states of the tissues differ in the prior assumptions about 
parameters #(■). We use the following priors (Fig. 1) to describe different 
groups: 



Under no ASE model M both alleles are expressed (almost) equally and 
hence 9(M) ~ 0.5. As in Skelly et al. (2011), our M model allows small 
deviations from exact point value of 0.5 in order to be robust against some 
technical measurement bias as well as very small ASE effects that are not a 
main focus in this study. The parameters for Beta distributions have been 
chosen in a way that clearly separates the three groups from each other 
(Fig. 1) and thus gives an informative framework to classify the tissues into 
three groups (see the Supplementary Information for further discussion on 
the prior specification and extensions to truncated priors and independence 
across tissues). Our implementation allows user to easily modify the prior 
parameters as well as use one-sided versions of the prior distributions instead 
of the two-sided versions used in our simulation experiments. For example, 
when studying nonsense-mediated decay, we may require that the reference 
allele is expressed more strongly, and hence consider only the one-sided ASE 
states. On the other hand, the two-sided ASE states are tailored for the 
situations where we do not want to make an assumption about which allele 
might be dominating. This is useful, for example, when studying imprinting. 

Configurations. 

For fixed prior distributions, the model space consists of 3 T configurations 
represented by j = (71, . . . ,Jt) vector where each tissue specific indicator 




(1) 



d{N) ~ Beta(2000, 2000) 

9(M) ~ ^Beta(36, 12) + ^Beta(12,36) 




ls e{N,M,S}. 
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Figure 1: Densities of the prior distributions for the proportion of reference allele 
for the three groups: N, A4, and S. 



We partition the space of configurations into five ASE states: 



NOASE = 
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SNGASE = 
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for all s : 
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Is 


= N and j t ± M} 


HET1 = 
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exist s, t 


Is 


7^ 7t and for all u 



where the states represent configurations for no ASE (NOASE), moderate 
ASE (MODASE), strong ASE (SNGASE), heterogeneity with at least one 
tissue showing no ASE (HETO) and heterogeneity with all tissues showing 
some ASE (HET1). We also consider a tissue specific sub state of the het- 
erogeneity states: 

TIS_SPE = {7 I exist s, t : 7 S 7^ 7t and for all u ^ t : 7« = 7 S }. 

In order to do probabilistic comparison between the states we need to 
define prior probabilities for each state. We do this by defining prior for 
each configuration in a way which depends on its distance from homogeneity. 
We define a distance <i(7) for each configuration 7 as the smallest number 
of tissues whose group indicators need to be changed in order to turn 7 
into a homogeneous configuration (either NOASE, MODASE or SNGASE). 
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Formally, d(j) = T — max{£ N , £ M , £ s }, where CnAm and i$ are the number 
of tissues that 7 assigns to Af, M., and S, respectively. In particular, the 
configurations with d = 0 are the three homogeneous configurations and the 
configurations with d — 1 form the set TIS_SPE. 

We specify the total prior probability for each possible value of d = 
0,...,T — [T/3] and then distribute it equally among the configurations 
with the same distance. This prior allows us to easily implement the idea 
that among the vast space of configurations we favor a priori the parsimo- 
nious ones, i.e., those where many tissues are similar. Our prior distribution 
extends the one recently used for muti-tissue eQTL setting (Flutre et al, 
2013) to the case of more than two expression states. In the results reported 
in this work, we have set a prior mass of 0.75 to d — 0 (i.e., 0.25 for each of 
NOASE, MODASE and SNGASE) and the remaining 0.25 has been divided 
equally among all possible values of d — 1, . . . , T — [T/3] . This choice was 
made because it gives an equal prior weight for the four main patterns of 
ASE: three homogeneous states and general heterogeneity. 

For settings where many variants are available for a joint analysis, we 
extend GTM to a hierarchical model GTM* that learns from the data the 
proportion of variants belonging to each of the five states and thus avoids the 
need to fix prior probabilities of the states. We describe GTM* and compare 
it to GTM in the Supplementary Information. 



2.2 Computation 

We use a standard Gibbs sampler to explore the posterior distribution of the 
configuration indicators 7 under GTM (see the Supplementary Information 
for details). 

The basic building block for model comparison is the Bayes factor between 
configuration 7 and the NOASE state 

where y = (y\., . . . ,i/t-)- Evaluation of BF(7) can be done analytically by 
using Beta-binomial likelihood separately for the numerator and the denom- 
inator. Thus we can quickly evaluate the Bayes factor for any particular 
configuration and compare even hundreds of configurations. However, when 
the number of tissues is large (say T > 10), the number of possible configu- 
rations grows too large to be exhaustively evaluated (analogous to a problem 
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with eQTLs in Flutre et al. (2013)). This becomes problematic in particu- 
lar when assessing heterogeneity (either HETO or HET1), which in principle 
would require a consideration of all those groupings that assume differences 
between some tissues. Therefore, we introduce an approximation that avoids 
enumerating all groupings when assessing heterogeneity, by focusing on only 
configurations that are strongly supported by the data. Thus we assume 
that the configurations with d > 1 that have not been visited by the Gibbs 
sampler and are not among a few dozen top heterogeneity models defined 
by tissue specific group membership probabilities, have negligible marginal 
likelihood and can be ignored. This leads to a lower bound for the marginal 
likelihoods of the heterogeneity states. In practice, a comparison to the exact 
values for cases where exact values can be calculated (T < 10) shows that 
the lower bound is actually a good estimate in most cases (see Results). 

By combining the Bayes factors with the prior probabilities of the states 
we have the posterior probabilities of the states. 

2.3 Q-statistic for heterogeneity 

We compare GTM to a standard heterogeneity measure 



where 9 S = y s i/n s and 9 are the empirical proportion of the reference allele at 
tissue s and across all tissues, respectively. (If y s i = 0, we set y sl = 0.5 and 
Vs2 — ~ 0.5 to avoid numerical problems.) The idea is that this measure 
increases with the heterogeneity of the tissue specific 8 S parameters, and can 
thus be used as a measure of heterogeneity. An empirical P-value of the 
Q-statistic is estimated by simulating data sets where the number of tissues 
and reads match with the observed data and where all tissues have the same 
value for 9 S = 9. 

3 Results 

We first assess performance of GTM on simulated data and compare it with 
a standard heterogeneity measure. Second, we illustrate GTM on a real 
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Table 1: Scenarios for simulations. 





N 


M 


S 


STATE 


1 


T 


0 


0 


NOASE 


2 


0 


T 


0 


MODASE 


3 


0 


0 


T 


SNGASE 


4 


T-l 


0 


1 


HETO 


5 


1 


0 


T — 1 


HETO 


6 


0 


T — 1 


1 


HET1 


7 


0 


1 


T-l 


HET1 


8 


pr/2i 


LT/2J 


0 


HETO 


9 


0 


\T/2] 


LT/2J 


HET1 



For each of the nine scenarios, Table shows how many tissues (out of total T) 
belong to each of the three possible groups, J\f, M. and S, whose 6 parameters are 
0.5, 0.75 and 0.99, respectively. The STATE column shows to which of the five 
combined states each scenario belongs. 

data example taken from the Genotype-Tissue Expression (GTEx) project. 
Results of GTM* are presented in the Supplementary Information. 

Data simulation. For three values for the number of tissues (T = 
5, 10, 30) and two values for the number of reads per each tissue (n s = 10, 50, 
for all s), we simulated 1,000 data sets for nine scenarios given in Table 1. 
The reference allele read count for each tissue s was sampled from Bin(n s , 9 S ), 
where 9 S is 0.5 for group ftf, 0.75 for group M. and 0.99 for group S. 

GTM results. We applied GTM on the simulated data sets with 2,000 
Gibbs sampler iterations and show results in Figure 2. The run time was 8, 
16 and 53 seconds per data set, for T = 5, 10 and 30, respectively (Intel i7 / 
3.40 GHz). 

For the homogeneous scenarios 1, 2 and 3, the detectability of the true 
state increases with the number of tissues, and in general is already high with 
only 5 tissues and 10 reads per each tissue. In particular, when the true state 
is SNGASE (scenario 3), there is no noticeable uncertainty about the correct 
state in any of the data sets, while for NOASE (scenario 1) and MODASE 
(scenario 2) the data are not equally decisive. This is related to the fact that 
the variance in the read counts is larger for 9 = 0.5 (NOASE) and 9 = 0.75 
(MODASE) than it is for the extreme value of 9 = 0.99 (SNGASE). 
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Figure 2: Each of the nine simulation scenarios (Table 1) is represented by three 
numbers of tissues (5, 10, 30) and two values for number of reads (10, left column 
and 50, right column). Each bar is divided into five colors (map given at the 
bottom) according to the (average) posterior probability of the five states when 
GTM was applied to the simulated data sets. 
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Our scenarios 4, 5, 6 and 7 consider the smallest possible amount of 
heterogeneity: a single tissue is different from the others. In these scenarios 
we see an opposite trend from the homogeneous ones: the true heterogeneous 
state becomes harder to distinguish from the closest homogeneous one as the 
number of tissues increases. This behavior implements an idea that if one 
tissue seems to deviate from the others, we believe more in that distinction 
if only a couple of other tissues are analyzed, than if there are tens of other 
tissue, that are all similar to each other. Information about the true state 
increases with the number of reads per tissue, and the overall heterogeneity 
probability (HETO + HET1) is high for all heterogeneous scenarios for read 
count 50. 

The scenarios 8 and 9 represent stronger heterogeneity where about half 
of the tissues belong to one group and the remaining half to another. When 
the true underlying groups are Af and M. (scenario 8), then 10 reads is 
not yet enough to clearly separate the true pattern from the homogeneous 
states (NOASE and MODASE), while 50 reads is enough for this purpose. 
In the scenario 9, where tissues are divided between A4 and S, the overall 
heterogeneity probability is always fairly large, but it is difficult to exclude 
the possibility that at least one tissue belongs to Af, especially when many 
tissues are available. As a consequence, the HETO state gets a considerable 
probability even though the true underlying state is HET1. To distinguish 
between the two heterogeneity states in this scenario requires read counts 
larger than 50. 

Taken together, the results in Figure 2 show that the model is correctly 
able to distinguish between all five combined states but an amount of infor- 
mation required for accurate classification varies between scenarios. 

Comparisons with Q-statistic. To show differences between our GTM 
heterogeneity probability and Q-statistic in detecting heterogeneity we show 
ROC curves for two settings (Fig. 3). In the first one, we use the 1,000 
data sets from simulation scenario 1 to represent a homogeneous state and 
the 1,000 data sets from scenario 4 to represent a heterogeneous state, with 
T = 30 tissues and n = 10 reads per tissue. The ROC curve on the left panel 
of Figure 3 shows how those 2,000 data sets are ranked by the posterior 
probability of HET0+HET1 state from GTM and by the empirical P-value 
of Q. The right panel in Figure 3 shows similar results when comparing the 
homogeneous scenario 3 to the heterogeneous scenario 7. In both settings 
GTM is slightly better in detecting heterogeneity than the Q-statistic: ROC 
curve of GTM is consistently above that of Q. We discuss the difference 
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SCENARIO 4 vs SCENARIO 1 SCENARIO 7 vs SCENARIO 1 




FALSE POSITIVE RATE FALSE POSITIVE RATE 

Figure 3: ROC curves for detecting heterogeneity using GTM and Q-statistic on 
simulated data sets. Left: scenario 1 (homogeneity) vs scenario 4 (heterogeneity). 
Right: scenario 3 (homogeneity) vs scenario 7 (heterogeneity). Parameters are 
T = 30 tissues and n = 10 reads per tissue. Heterogeneity statistics are the 
posterior probability of HET0+HET1 from GTM and the empirical P-value from 
the Q-statistic. 



between the two approaches in Discussion. 

Accuracy of approximation. To assess how accurate our approxima- 
tion for marginal likelihoods of the heterogeneity states is, we compared the 
posterior probabilities of the five states from GTM with the exact values 
on all 9,000 data sets simulated with T = 10 tissues and n = 10 reads per 
tissue. As a measure of accuracy we use the total variation distance (TV), 
which for discrete distributions (p*) and (<&) is defined as \ £V \pi — q%\. TV 
describes how much of the probability mass needs to be relocated in order 
to turn the first distribution into the other, and also gives an upper bound 
for the maximal difference in probability that the two distributions assign to 
any one state. 

We observed (Table 2) that 95% of the data sets had TV < 0.05, and 
0.5% had TV > 0.10, with maximal TV being 0.168. As TV less than 0.10 is 
unlikely to change our inference on the underlying state, our approximation 
works well for a great majority of the data sets tested. However, there 
are large differences in the accuracy between the scenarios with scenario 8 
(5 tissues in M and other 5 in M) showing the strongest discrepancy. An 
explanation for this is that with only 10 reads per tissue the scenario 8 assign 
non-negligible posterior probability to so many of the configurations showing 
heterogeneity between M and Ai groups that some of them are missed during 
our default number of 2,000 Gibbs sampler iterations. 

In Discussion we propose an additional heterogeneity measure to comple- 
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Table 2: Accuracy of GTM. 



SCENAR 


TV< 0.01 


TV < 0.05 


TV< 0.10 


max 


1 


0.462 


0.972 


1 


0.092 


2 


0.452 


0.959 


0.995 


0.129 


3 


1 


1 


1 


0.006 


4 


0.160 


0.986 


0.999 


0.105 


5 


0.997 


1 


1 


0.011 


6 


0.327 


0.973 


1 


0.070 


7 


0.994 


1 


1 


0.012 


8 


0.081 


0.663 


0.962 


0.168 


9 


0.565 


1 


1 


0.025 



For each of the nine scenarios, Table shows the proportion of simulated data sets 
(out of total 1,000 with T = 10 tissues and n = 10 reads per tissue) for which 
total variation (TV) distance between GTM estimates and the exact posterior 
distribution of the five states is less than 0.01, 0.05 or 0.10. The max column 
shows the maximum TV distance observed. 

merit our approximation for the posterior probability of heterogeneity states, 
especially for data sets with large number of tissues. 

3.1 A protein truncating variant in ECM1 

We consider read count data on SNP rsl21909115 in chromosome 1, whose 
non-reference allele (T) introduces a stop codon in some transcripts of the 
gene ECM1. This is an example of a protein-truncating mutation that we 
expect to experience nonsense-mediated decay (NMD) leading to a reduction 
in the transcripts from the non-reference allele. However, the strength of 
NMD and its consistency across different tissue types is unknown. In the 
currently available GTEx data we have one individual who is heterozygous 
at this SNP and Figure 4 shows the data and GTM results across 7 tissues. 

The results show that, as expected, most tissue types show ASE where 
the non-reference allele has lower read counts than the reference allele. In 
addition, there is evidence of heterogeneity between the tissue (p(HET0|y) = 
0.93) and that heterogeneity could result from a tissue specific effect where 
the skin tissue escapes ASE (p(TIS_SPE|y) = 0.24). None of the tissues 
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NOASE | 
MODASE |J 
SNGASE | 

HETO | ~~| 
HET1 ( 
TIS_SPE | ] 

I 1 1 1 1 1 

0.0 0.2 0.4 0.6 0.8 1.0 

STATE PROBABILITIES 



o 




Figure 4: Data on rsl21909115. Top panel shows the posterior probabilities for 
six states as defined in Methods. The tissue specific state (TIS_SPE) is a sub- 
set of the heterogeneity states (HETO and HET1) and the probabilities of the 
other five states sum to one. Middle panel shows the point estimates of the non- 
reference allele frequency among RNA-seq reads across seven tissue types (named 
at the bottom) together with their 95% credible intervals. Bottom panel shows 
the posterior probability distribution of the group indicator (7^) for each tissue 
type, where white, gray and black denote groups A", M. and S, respectively. Tis- 
sue types: ARTTBL=Artery tibial, NERVET=Nerve tibial, ADPSBQ=Adipose 
subcutaneous, HRTLV=Heart ventricle, MSCSKL=Muscle skeletal, SKINS=Skin 
sun exposed. 
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shows strong evidence for complete ASE (group <S), but rather the other six 
tissue types (apart from skin) are likely to belong to either the moderate 
ASE group Ai, or to no ASE group (nerve and adipose). 

The non-reference allele (T) at the SNP rsl21909115 is one of the known 
protein truncating mutations in ECM1 that in homozygous carriers lead to 
lipoid proteinosis, also known as Urbach-Wiethe disease, (OMIM 247100), 
(Hamada et al., 2003). The symptoms of this disease include scarring and 
infiltration of skin and mucosae (Hamada, 2002). Therefore, it is an inter- 
esting observation that in our data the allele with the nonsense mutation is 
expressed more strongly in the skin tissue than in several other tissue types. 

4 Discussion 

We have introduced a statistical framework to assess similarities and differ- 
ences in allelic specific expression (ASE) between tissue types. A motiva- 
tion for our work comes from ongoing RNA-sequencing projects such as the 
GTEx project (GTEx-Consortium, 2013) that generates data on up to 30 
tissue types per individual with read counts per tissue and per site starting 
from around 10. 

We have chosen a Bayesian approach because it leads to a consistent 
probabilistic quantification of the support that the data provide for each of 
the competing models. We see this as an advantage over a series of separate 
analyses, such as, for example, would be needed by an approach that first 
assessed heterogeneity using the Q-statistic and if no (statistically significant) 
heterogeneity was observed, would further classify the data set into one of the 
homogeneous states. Previously, two excellent studies on Bayesian models 
for expression data have been published by Skelly et al. (2011) and Flutre 
et al. (2013). 

Skelly et al. (2011) consider a three-stage hierarchical model for allele read 
counts from genome-wide RNA-seq data of one individual. They observe al- 
lele counts at each heterozygous SNP (1st level) which are assigned to genes 
(2nd level) whose common properties are controlled by genome- wide parame- 
ters (3rd level). Their model would be directly applicable also to multi-tissue 
RNA-seq data where tissue specific allele counts replace SNP-specific allele 
counts and sites replace genes at the second level of the model. Suppose, 
for example, that we had multi-tissue RNA-seq data on a set of individuals 
who are heterozygous for at least one protein-truncating variant (PTV). The 
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model of Skelly et al. (2011) would produce posterior distribution on the 
global proportion of the PTVs that show ASE in at least one tissue type, 
as well as variant specific posterior probability of ASE. However, it would 
not give as refined characterization of ASE at each PTV as our hierarchical 
model GTM* (Supplementary Information) and it would not do inference on 
tissue specific group indicator parameters (7^) that allow probabilistic model 
comparison between different patterns of ASE. 

Flutre et al. (2013) developed a method for a joint eQTL analysis across 
multiple tissues. They work with micro-array expression data using a linear 
model that is not directly applicable to the data sets we have in mind: read 
count data from several tissues of a single individual. They use tissue spe- 
cific binary indicator parameters that tell whether a variant is an eQTL in 
each tissue and introduce three ways to assign prior probabilities to differ- 
ent configurations of the indicators. Their "lite" model gives positive prior 
on only those configurations whose distance from homogeneity is at most 1. 
Their "BMA" model is similar to what we have used in that the prior of a 
configuration depends only on its distance from homogeneity and that the 
total prior probability corresponding to each value of distance is the same. 
Finally, their most complex "BMA-HM" model treats the weights of the 
configurations as random variables and estimates them across genes using a 
hierarchical model. Similar hierarchical model, that would learn joint ASE 
patterns between tissues by a simultaneous analysis of a set of genetic vari- 
ants (e.g. PTVs) is also an important topic for further development of our 
hierarchical model GTM*. 

Even though hierarchical models, such as our GTM* and those of Skelly 
et al. (2011) and Flutre et al. (2013), are conceptually attractive, we believe 
that GTM's ability to analyze one PTV at a time has its merits from a 
practical point of view: it is quick to run, easy to understand and requires 
read count data on only one variant. 

Heterogeneity measures. 

Our model assumes that all tissues in a same group have the same reference 
allele read frequency 9. In practice, we expect that our model is robust to 
some heterogeneity within each group, since the priors for different groups 
are so clearly separable from each other (Fig. 1 and Supp Fig. 1). Our main 
interest is to assign tissue types into (three) broad categories of ASE, and 
consequently we call a data set heterogeneous only if some pair of tissues 
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show fairly strong difference in the non-reference allele frequency. Standard 
heterogeneity measures, such as Q, ask a slightly different question of whether 
parameters 9 S have exactly the same value across the tissues. A frequentist 
answer to this latter question is given by an empirical P- value of heterogeneity 
measures Q or p(HET|y) estimated under the null hypothesis that all tissues 
have the same value for 9 parameter which is estimated from the data. By 
this approach, heterogeneity P-values in the ECM1 example of Figure 4 are 
0.011 and 0.006 by using Q and p(HET0|y), respectively, as a test statistic. 
While these P-values point to general heterogeneity between the tissue types, 
our GTM analysis leads to more detailed information considering the type 
of heterogeneity: we see heterogeneity where at least one tissue type escapes 
ASE (our HET0 state). 

Our heterogeneity probabilities are based on a lower bound of the true 
marginal likelihoods of the heterogeneity states. We showed that in a large 
majority of our data sets (with 10 tissues) the approximation is accurate, 
but the approximation may not always work as well with larger number of 
tissues. Therefore, in addition to the heterogeneity probabilities, we also 
compute, for each pair of tissues, a posterior expectation of the distance 
between them. Here we define the distance between two tissues to be 0 if 
they belong to a same group and 1 otherwise. Maximum of all pairwise 
distances gives an indication whether there is heterogeneity between tissues. 
By further defining the distance between the ASE groups M. and S to be 
0, we have a measure for particular kind of heterogeneity (HETO) where at 
least one tissue belongs to the no ASE group M . 

4.1 Application of the Method 

Nonsense-mediated decay. We envisage that a primary application of 
our method will be in analyzing nonsense- mediated decay (NMD). Protein 
truncating variants are usually subject to NMD, a cellular mechanism that 
detects premature termination codons and prevents expression of truncated 
proteins. Integrated genome and transcriptome sequencing studies in lym- 
phoblastoid cell lines have demonstrated that allele specific expression can 
be used for testing variants predicted to trigger NMD (Montgomery et ai, 
2010; MacArthur et ai, 2012; Montgomery et ai, 2011; Lappalainen et ai, 
2013). To test whether the predicted NMD truly happens, we can use the 
one-sided ASE models rather than the two-sided ones to explicitly require 
that an ASE signal is present only if the minor allele, and not the major al- 
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lele, is silenced. We applied the one-sided approach in our example analysis 
of a PTV in ECM1. 

Imprinting. Genomic imprinting is a phenomenon where only an allele 
inherited from the parent of a particular sex is expressed (Babak et ai, 2008). 
To assess imprinting for one tissue type one could consider a common coding 
variant of the gene of interest in that tissue across multiple heterozygous 
individuals. If the gene is imprinted, then in about half of the individuals 
the allele 1 should be silenced, because we expect that in about half of the 
individuals the allele 1 at this locus is inherited from the mother and in the 
other half it originates from the father. We could extend our framework to 
such by allowing each ASE group to be divided into two subgroups 

with proportion parameters 9 and 1 — 9, respectively, where 9 has a Beta-prior 
as in our GTM model. 

Modest ASE effects. An interesting application of allele specific expres- 
sion is in the study of cis-regulatory variants in LD with coding variants. 
When regulatory variants have only modest effects (Dimas et ai, 2009), we 
could modify our prior on ASE states to reflect this. With this approach we 
envision that researchers are able to study the effect of czs-regulatory variants 
on transcription across a broad range of tissues where the number of samples 
per tissue may be limited. However, compared to strong ASE effects, more 
modest effects require much larger read counts per tissue and decrease the 
ratio between biological signal and possible technical noise (Degner et ai, 
2009). 

Multiple individuals. Suppose we have RNA-seq data on the same tissue 
types from several individuals who are heterozygous at a particular variant. 
We could first assess, for each tissue type, whether the individuals are het- 
erogeneous in their ASE status. If there is no evident heterogeneity, a simple 
approach is to combine the reads from the same tissue type across the indi- 
viduals before analyzing the data across the tissues. A more refined model 
that accounts for possible individual specific effects that are shared across 
the tissues requires further work. 
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5 Conclusion 

We have presented grouped tissue model (GTM) and its multi-site extension 
(GTM*), to (i) classify tissues into three groups at each site according to 
their allele specific expression (ASE) patterns and (ii) classify the sites into 
five combined ASE states according to their tissue-wide ASE profiles. We see 
major applications of our approach in assessing homogeneity, heterogeneity 
and tissue specificity across a group of genetic variants assumed to have 
similar properties, e.g. variants predicted to trigger nonsense-mediated decay, 
variants in genes with evidence of imprinting, or variants in LD with GWAS 
loci for a particular disease. 

As an example, we presented an application of the method to read count 
data from the GTEx project for one heterozygote carrier of a premature 
truncating mutation (p.R53X, rsl21909115) in the ECM1 gene. For this 
variant, which in homozygous form is known to cause lipoid proteinosis, we 
identified heterogeneous gene expression effects across tissues and evidence 
for a complete escape from nonsense-mediated decay in skin tissue. 

The identification and characterization of ASE, in particular for protein 
truncating variants with a putative complete loss of function effect, will pro- 
vide a better understanding of the biological mechanisms that are involved 
in transcriptional regulation and will improve our computational models for 
annotating variants identified in case-control or clinical genome sequencing 
studies. 
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